Red Wine Data Analysis by Monica London

The purpose of this analysis is to explore a dataset featuring characteristics about red wine.

Summary

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Structure

str(RedWine)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

The red wine data set contains nearly 1600 observations of 13 variables.

Univariate Plots Section

We can see the distribution of quality ratings has a minimum of 3 and a maximum of 8, with most ratings at 5 or 6. Surprisingly, there are no ratings of 1, 2, 9, or 10. I would have expected a larger range of quality ratings with such a large data set.

I divided the data by quality level, with 0-3, 4-6, and 7-10 being the three levels. We can see that the vast majority of observations fall in the medium quality level.

We can see that the Density and pH plots are the most normally distributed. Thee majority of pH levels fall between 3.0 - 3.5. Many of the plots are skewed to the right, including Free Sulfur Dioxide, Total Sulfur Dioxide. The majority of wines havie less than 100 in total sulfur dioxide. Several of the plots are long tailed, such as Residual Sugar and Chlorides.

The above plots compare the plots before and after transformation. The data for residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide becomes more normally distributed after applying log10.

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol, and quality). All of the variables are numeric with the exception of quality, which is in integer form.

Other observations:

Most of the wines have a quality of 5 or 6.

The 3rd quartile of residual sugar levels is 2.6, although there are a few major outliers, with the maximum residual sugar level of 15.5. I’m interested to see if higher residual sugar wines tend to have lower or higher quality.

Most wines have an alcohol content of less than 12%. This surprises me, given that the majority of red wines I’m familiar with have alcohol contents above 13.5%.

Many of the wines have 0 citric acid.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the dataset are quality, and I’d like to determine which variables impact quality ratings the most. I suspect alcohol, residual sugar, and pH contribute to quality ratings, as they seem to be features you may be able to decipher during wine tastings.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

From research into what contributes to the taste of wine, I discovered that sweetness, acidity, tannin, alcohol, and body are the main features. In addition to pH, I think fixed acidity and volatile acidity may contribute to the acidity of wine.

Did you create any new variables from existing variables in the dataset?

Yes, I created a new variable called quality level, which cuts the quality levels into low (3, 4), medium (5, 6), and high (7,8).

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I deleted column X because it was simply a repeat of the index.

I applied log10 to residual sugar, chlorides, free sulfur dioxide, and total sulfur dioxide in order to normalize the distributions.

Bivariate Plots Section

I analyzed the following bivariate relationships:

Quality vs. Alcohol Quality vs. pH Quality vs. Residual Sugar Quality vs. Fixed Acidity Quality vs. Volatile Acidity Residual Sugar vs. Alcohol Residual Sugar vs. pH pH vs. Alcohol Fixed Acidity vs. Density Fixed Acidity vs pH pH vs. Citric Acid Quality Level vs. Alcohol Quality Level vs. pH Quality Level vs. Residual Sugar

The correlogram indicates that the majority of relationships between variables are not highly correlated. The strongest relationships appear to be density vs. fixed acidity (r^2 = 0.67), citric acid vs. fixed acidity (r^2 = 0.67, pH vs. fixed acidity (r^2 = -0.68), and total sulfur dioxide vs. free sulfur dioxide (r^2 = 0.67). A correlation between citric acidy and fixed acidity is not surprising as they are both acids. Free sulfur dioxide is part of total sulfur dioxide so a correlation is expected. pH is a measure of acidity so the correlation between pH and fixed acidity is not surprising either. I am unsure what would cause a correlation between density and fixed acidity, but it could be that higher acidic liquid is more dense than lower acidic liquid.

It is interesting to see how the alcohol level tends to be much higher in the higher quality wines than the medium or low quality wines.

The median pH level decreases as the quality increases. The data is also more compact at the higheset quality level. As the quality level increases, the pH range decreases.

The range of outliers (plotted in red) is large in this plot, especially in the medium quality level. All of the outliers in all quality levels are high outliers; they have very high levels of residual sugar rather than very low levels of residual sugar.

Both the IQR and median volatile acidity decreases as quality level increases.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The main feature of interest in this analysis is quality, and if any features show an affect on quality. The correlogram shows that the highest r^2 value between quality and any other feature is alcohol (r^2 = 0.48). While you could see this trend in alcohol vs. quality scatterplot, the extreme number of quality ratings of 5 and 6 made it difficult to see the true relationship between alcohol and quality in the higher quality wines. The he quality level vs. alcohol boxplot shows this relationship much better, with a clear increase in median alcohol levels in the highest quality wines.

The relationship between quality vs. pH has an r^2 value of 0.06, which indicates practically zero correlation, and the corresponding scatterplot confirms this. However, when pH is compared to quality levels, there is a pattern in the boxplot. The median pH levels seem to decrease as the quality level increases, especially between the lower quality and medium quality wines.

The correlogram indicates that there is no correlation (r^2 = 0.01) between residual sugar and quality. Even after transforming the residual sugar data using log10 and plotting it against quality levels, there seems to be no clear relationship betwen residual sugar and quality.

One of the most poignant bivariate relationships discovered was the relationship between quality level and volatile acidity. The correlogram shows an r^2 value of -0.39 between quality and volatile acidity. The scatterplot shows this moderate correlation, but the relationship is much clearer when the data is grouped by quality level in the volatile acidity vs. quality level boxplot.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Some of the most interesting relationships were between variables that were not the main feature of interest. In fact, three of the four strongest r^2 values included fixed acidity vs. another variable. Fixed acidity had the strongest correlations with density, citric acid, and pH. As discussed previously, this is not surprising given that many of the variables are either acid themselves or a measure of acidity.

What was the strongest relationship you found?

The strongest relationship, according to the r^2 value, is between pH and fixed acidity (r^2 = 0.68). However, once the data was cut into quality levels, the plots indicate that there are strong relationships between quality level and alcohol, quality level and volatile acidity, and quality level and pH.

Multivariate Plots Section

Because the majority of the data has a medium quality level, the data is highly clustered. The smoother shows a slightly higher fixed acidity vs. density ratio for higher quality level wines vs. medium or lower quality level wines.

This plot doesn’t show strong trends, but it does show how the majority of the data falls in the lower alcohol, lower pH quadrant of the chart.

This plot shows some differences in the relationship between pH and citric acid by quality level.

The relationship between pH and fixed acidity seems to be uniform across all quality levels.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = RedWine)
## m2: lm(formula = quality ~ alcohol + pH, data = RedWine)
## m3: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)), 
##     data = RedWine)
## m4: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar, data = RedWine)
## m5: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity, data = RedWine)
## m6: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity, data = RedWine)
## m7: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)), 
##     data = RedWine)
## m8: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide, data = RedWine)
## m9: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide + citric.acid, data = RedWine)
## m10: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide + citric.acid + I(log10(chlorides)), 
##     data = RedWine)
## m11: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide + citric.acid + I(log10(chlorides)) + 
##     I(log10(free.sulfur.dioxide)), data = RedWine)
## m12: lm(formula = quality ~ alcohol + pH + I(log10(residual.sugar)) + 
##     residual.sugar + fixed.acidity + volatile.acidity + I(log10(total.sulfur.dioxide)) + 
##     total.sulfur.dioxide + citric.acid + I(log10(chlorides)) + 
##     I(log10(free.sulfur.dioxide)) + sulphates, data = RedWine)
## 
## ==========================================================================================================================================================================================================
##                                        m1            m2            m3            m4            m5            m6            m7            m8            m9           m10           m11           m12       
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                         1.875***      4.426***      4.526***      4.539***      3.014***      3.559***      3.971***      3.834***      3.901***      3.871***      4.036***      3.359***  
##                                      (0.175)       (0.387)       (0.393)       (0.393)       (0.601)       (0.576)       (0.607)       (0.605)       (0.606)       (0.605)       (0.610)       (0.606)    
##   alcohol                             0.361***      0.386***      0.389***      0.391***      0.386***      0.330***      0.321***      0.325***      0.330***      0.321***      0.315***      0.292***  
##                                      (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.017)       (0.018)       (0.018)       (0.018)    
##   pH                                               -0.850***     -0.870***     -0.872***     -0.506**      -0.259        -0.288        -0.426**      -0.456**      -0.513**      -0.524**      -0.487**   
##                                                    (0.116)       (0.116)       (0.116)       (0.159)       (0.154)       (0.154)       (0.157)       (0.158)       (0.161)       (0.160)       (0.158)    
##   I(log10(residual.sugar))                                       -0.171        -0.495        -0.728*       -0.199        -0.120        -0.142        -0.127        -0.084        -0.031         0.057     
##                                                                  (0.114)       (0.313)       (0.320)       (0.309)       (0.311)       (0.309)       (0.309)       (0.310)       (0.310)       (0.305)    
##   residual.sugar                                                                0.038         0.059         0.012         0.009         0.020         0.020         0.018         0.011         0.009     
##                                                                                (0.034)       (0.035)       (0.033)       (0.033)       (0.033)       (0.033)       (0.033)       (0.033)       (0.033)    
##   fixed.acidity                                                                               0.047***      0.023         0.018         0.009         0.022         0.018         0.018         0.017     
##                                                                                              (0.014)       (0.014)       (0.014)       (0.014)       (0.016)       (0.016)       (0.016)       (0.016)    
##   volatile.acidity                                                                                         -1.249***     -1.255***     -1.233***     -1.332***     -1.282***     -1.253***     -1.114***  
##                                                                                                            (0.101)       (0.101)       (0.100)       (0.117)       (0.120)       (0.120)       (0.120)    
##   I(log10(total.sulfur.dioxide))                                                                                         -0.124*        0.424**       0.405**       0.429**       0.210         0.103     
##                                                                                                                          (0.058)       (0.144)       (0.145)       (0.145)       (0.178)       (0.175)    
##   total.sulfur.dioxide                                                                                                                 -0.006***     -0.005***     -0.006***     -0.005***     -0.004**   
##                                                                                                                                        (0.001)       (0.001)       (0.001)       (0.001)       (0.001)    
##   citric.acid                                                                                                                                        -0.234        -0.170        -0.134        -0.226     
##                                                                                                                                                      (0.142)       (0.146)       (0.147)       (0.145)    
##   I(log10(chlorides))                                                                                                                                              -0.248        -0.244        -0.552***  
##                                                                                                                                                                    (0.132)       (0.131)       (0.135)    
##   I(log10(free.sulfur.dioxide))                                                                                                                                                   0.203*        0.198*    
##                                                                                                                                                                                  (0.095)       (0.094)    
##   sulphates                                                                                                                                                                                     0.813***  
##                                                                                                                                                                                                (0.108)    
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                           0.227         0.252         0.253         0.254         0.259         0.324         0.326         0.333         0.334         0.336         0.338         0.361     
##   adj. R-squared                      0.226         0.251         0.252         0.252         0.257         0.322         0.323         0.330         0.331         0.332         0.333         0.356     
##   sigma                               0.710         0.699         0.699         0.699         0.696         0.665         0.664         0.661         0.661         0.660         0.659         0.648     
##   F                                 468.267       268.888       180.161       135.446       111.286       127.233       109.959        99.324        88.685        80.298        73.574        74.568     
##   p                                   0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood                  -1721.057     -1694.466     -1693.325     -1692.710     -1687.117     -1613.455     -1611.152     -1602.601     -1601.237     -1599.456     -1597.172     -1568.959     
##   Deviance                          805.870       779.508       778.397       777.799       772.376       704.393       702.367       694.895       693.711       692.167       690.192       666.261     
##   AIC                              3448.114      3396.931      3396.650      3397.421      3388.234      3242.909      3240.303      3225.203      3224.475      3222.913      3220.344      3165.918     
##   BIC                              3464.245      3418.440      3423.536      3429.684      3425.874      3285.926      3288.697      3278.974      3283.623      3287.438      3290.247      3241.198     
##   N                                1599          1599          1599          1599          1599          1599          1599          1599          1599          1599          1599          1599         
## ==========================================================================================================================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

When looking at fixed acidity vs. citric acid in terms of quality levels, there does seem to be a relationship. For any given pH, the citric acid level appears to increase as the quality level increases. This relationship isn’t as clear as the pH levelincreases above 3.5. This may be because of fewer data points at that level.

In the density vs. fixed acidity in terms of quality level plot, there is a very relationship. For a given fixed acidity level below 12, the average density level is higher for lower quality wines than higher quality wines.

Were there any interesting or surprising interactions between features?

In the density vs. fixed acidity in terms of quality level plot, the smoother for the lowest quality wines does not appear linear. It almost appears logarithmic, rising in density value slower as the fixed acidity value increases.

Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a model to predict quality with several variables, including alcohol, pH, residual sugar, fixed acidity, volatile acidity, total sulfur dioxide, citric acid, chlorides, free sulfur dioxide, and sulphates. The r^2 value of the model is 0.361. Because quality ratings are chosen by humans and are not scientic, an r^2 value of 0.361 is relatively strong.

The number of observations with 3, 4, 7, and 8 quality ratings is so much lower than the number of observations with 5 and 6 quality ratings. A bigger overall data set and more observations in the lower and higher quality ratings would improve the model.

Final Plots and Summary

Plot One

Description One

This plot shows a similar median alcohol level for both low and medium quality wines. Interestingly the median alcohol level spikes much higher for the highest quality wines.

Plot Two

Description Two

This plot indicates that the median pH level steadily decreases as the quality level increases.

Plot Three

Description Three

This plot is notable because it visualizes the relationship between the two variables with the strongest correlation in the data set. Fixed acidity and pH have an r^2 value of -0.68. This relationship makes a lot of sense because a pH level is a measure of acidity; a lower pH indicates a substance is more acidic and a higher pH indicates a substance is more basic. This plot confirms this relationship. ——

Reflection

I wanted to find out which variables most impacted quality. There were several insights I found during the exploration of this data set.

Alcohol and Quality: There is a clear relationship between quality level and alcohol content, but only for the highest quality wines. This relationship is unclear until quality is separated into quality levels.

pH and Quality: There is a negative relationship between pH and quality levels. This is unclear until quality is separated into quality levels.

Volatile Acidity and Quality: There is a very strong relationship between volatile relationship and quality levels. The relationship is only moderate when quality is not divided into levels.

The correlogram was very helpful in showing correlations between variables except for quality. In several cases, the relationship between quality and a specific variable was unclear until quality was separated into quality levels.

It would be most helpful to know the type of red wine, such as Cabernet Sauvignon, Pinot Noir, Merlot, etc. It is very difficult to analyze trends when the type of the wine is unknown. For example, a certain wine type may be expected to have more alcohol and therefore someone rating the quality of that wine would rate it more favorably than someone rating a quality of wine that was expected to have a lower alcohol content. Furthermore, it would be interesting to analyze wines from different parts of the world to see if there is a relationship between quality or any of the variables and region.